Heuristic shortest hyperpaths in cell signaling hypergraphs

Background Cell signaling pathways, which are a series of reactions that start at receptors and end at transcription factors, are basic to systems biology. Properly modeling the reactions in such pathways requires directed hypergraphs, where an edge is now directed between two sets of vertices. Inferring a pathway by the most parsimonious series of reactions corresponds to finding a shortest hyperpath in a directed hypergraph, which is NP-complete. The current state-of-the-art for shortest hyperpaths in cell signaling hypergraphs solves a mixed-integer linear program to find an optimal hyperpath that is restricted to be acyclic, and offers no efficiency guarantees. Results We present, for the first time, a heuristic for general shortest hyperpaths that properly handles cycles, and is guaranteed to be efficient. We show the heuristic finds provably optimal hyperpaths for the class of singleton-tail hypergraphs, and also give a practical algorithm for tractably generating all source-sink hyperpaths. The accuracy of the heuristic is demonstrated through comprehensive experiments on all source-sink instances from the standard NCI-PID and Reactome pathway databases, which show it finds a hyperpath that matches the state-of-the-art mixed-integer linear program on over 99% of all instances that are acyclic. On instances where only cyclic hyperpaths exist, the heuristic surpasses the state-of-the-art, which finds no solution; on every such cyclic instance, enumerating all source-sink hyperpaths shows the solution found by the heuristic was in fact optimal. Conclusions The new shortest hyperpath heuristic is both fast and accurate. This makes finding source-sink hyperpaths, which in general may contain cycles, now practical for real cell signaling networks. Availability Source code for the hyperpath heuristic in a new tool we call Hhugin (as well as for hyperpath enumeration, and all dataset instances) is available free for non-commercial use at http://hhugin.cs.arizona.edu.

In the field of algorithms, Italiano and Nanni [14] first proved that finding a shortest source-sink hyperpath is NP-complete, even when hyperedges have a single head vertex. In a seminal paper that is the source for much of the subsequent work on hypergraphs, Gallo, Longo, Pallottino and Nguyen [15] explore special cases of hypergraphs, and define several versions of hyperpaths, including what they call a B-path (though see the correction of Nielsen and Pretolani [24]), which is essentially equivalent to our definition of hyperpath (given in the following section on shortest hyperpaths in directed hypergraphs). They show the vertices reachable from a source vertex in a hypergraph can be found in time linear in the total size of the tail and head sets of all hyperedges, give an efficient algorithm for a variant of shortest hyperpaths with a so-called additive cost function, and prove that finding a minimum cut in a hypergraph is NP-complete. Ausiello and Laura [16] survey results on hypergraphs whose hyperedges have singleton head sets, and note that a consequence of the NP-completeness reduction [14] for shortest hyperpaths from the set cover problem is that, unless P = NP , no approximation algorithm can exist for shortest hyperpaths on hypergraphs of n vertices with approximation ratio 1−o(1) ln n.
In metabolic networks, Cottret, Milreu and Acuña et al. [17] examine the minimum precursor problem: given a hypergraph G, a set of sources S, and a set of targets T, find a source subset P ⊆ S of minimum cardinality that has a factory from P to T, where a factory is a set of hyperedges that produce targets T from precursor set P while satisfying weaker ordering constraints on hyperedges than required by hyperpaths. They show this problem is NP-complete, and give an algorithm that enumerates all minimal precursor sets whose factory is acyclic. Acuña, Milreu and Cottret et al. [18] subsequently enumerate all minimal precursor sets allowing cycles. Andrade, Wannagat and Klein et al. [19] extend these algorithms to accommodate stoichiometry and conserve intermediate metabolites within the factory. Carbonell, Fichera, Pandit and Faulon [20] give an efficient algorithm to find a source-sink hyperpath if one exists-irrespective of its length-and prove that finding any hyperpath that must contain a specified set of hyperedges is NP-complete. They also offer an approach to hyperpath enumeration that relies on solutions to this NP-complete problem, for which they employ a heuristic.
In cell-signaling networks, Ritz, Avent and Murali [12,21] were the first to solve the shortest acyclic hyperpath problem by formulating it as a mixed-integer linear program (MILP)-the current state-of-the-art for shortest hyperpaths-and showed that in practice, optimal acyclic hyperpaths can be found even for large cell-signaling hypergraphs. Their formulation does not extend to hyperpaths with cycles, and requires exponential time in the worst-case (which may be unavoidable, as the acyclic problem remains NP-complete). Recently, Franzese, Groce, Murali and Ritz [22] defined a parameterized notion of connectivity that interpolates between hyperpath-and ordinary-path-connectivity, while Schwob, Zhan and Dempsey [23] modified the acyclic MILP of Ritz et al. [21] to include time-dependence among reactions.

Our contributions
In contrast to prior work, we present a heuristic for shortest hyperpaths that handles cycles, is worst-case efficient, and finds hyperpaths that are demonstrably optimal or close to optimal in real cell-signaling hypergraphs. In more detail, we make the following contributions.
• We present an efficient heuristic for shortest hyperpaths, that on a hypergraph of size ℓ , which measures the total cardinality of all hyperedge tail and head sets, with m hyperedges that are doubly-reachable from the source and sink vertices, and k defined analogously to ℓ over these doubly-reachable hyperedges, runs in O(ℓ + m 2 k) time.
• We prove that the heuristic finds an optimal shortest hyperpath for the class of singleton-tail hypergraphs, where the tails of all hyperedges in the hypergraph are single vertices.
• We also give a practical algorithm for hyperpath enumeration that generates all possible source-sink hyperpaths, allowing us to tractably measure how close our heuristic is to the optimum.
• Our heuristic matches the state-of-the-art MILP for shortest acyclic hyperpaths on over 99% of all instances from two standard databases of cell-signaling pathways.
• Our heuristic surpasses the state-of-the-art on instances where every source-sink hyperpath is cyclic, and hence the MILP finds no solution. On all such cyclic biological instances, our hyperpath enumeration algorithm verified that the heuristic was in fact optimal.
To our knowledge, this heuristic is the first in the literature for shortest source-sink hyperpaths in general directed hypergraphs, where hyperedges have arbitrary tail and head sets, and the length of a hyperpath is the sum of the weights of its hyperedges. We note that the worst-case efficiency and subclass optimality of the heuristic highlighted in the first two points above show that the shortest hyperpaths problem is polynomial-time solvable for singleton-tail hypergraphs-in contrast to its NP-completeness for singleton-head hypergraphs [14]-which does not appear to have been observed before in the literature [16]. Furthermore, while prior work has developed specialized algorithms that are tailored to shortest hyperpaths under so-called additive cost functions [15]-which also handle singleton-tail hypergraphs-in distinction, we give a general heuristic for arbitrary hypergraphs under the non-additive cost function of total weight of the hyperpath, that as a consequence is optimal for the special case of singleton-tail hypergraphs.
Source code for an implementation of the shortest hyperpath heuristic in a new tool we call Hhugin [25] (short for "hypergraph heuristic for general shortest source-sink hyperpaths"), as well as the hyperpath enumeration algorithm and all dataset instances, is available free for non-commercial use at http://hhugin.cs.arizona.edu.

Plan of the paper
The next section defines the general shortest hyperpath problem, allowing cycles. The following section then presents our heuristic for shortest hyperpaths, analyzes its time complexity, shows it returns a feasible solution whenever one exists, and proves it finds optimal solutions for singleton-tail hypergraphs. The next section gives our algorithm for generating all source-sink hyperpaths, proves its correctness, and analyzes its time complexity. The subsequent section compares the heuristic, through experiments on all source-sink instances from standard databases, to the state-of-the-art MILP for acyclic instances, or to the optimum of all enumerated hyperpaths for cyclic instances, and discusses three examples of cyclic shortest hyperpaths in cell signaling networks. Finally, the last section concludes, and provides directions for further research.

Shortest hyperpaths in directed hypergraphs
A directed hypergraph is a generalization of an ordinary directed graph, where an edge, instead of touching two vertices, now connects two subsets of vertices. Formally, a directed hypergraph is a pair (V, E), where V is a set of vertices, and E is a set of directed hyperedges. (The literature sometimes uses the term hyperarc for an edge in a directed hypergraph, but we prefer the simpler term hyperedge-just as the term edge is conventionally used for both directed and undirected ordinary graphs. We will occasionally abbreviate the term hyperedge to simply edge, when it is clear that the context is with respect to a directed hypergraph.) Each hyperedge e ∈ E is an ordered pair (X, Y), where both X, Y ⊆ V are vertex subsets. Edge e is directed from set X to set Y. We call set X the tail of e, and set Y the head of e, and refer to these sets by the functions tail(e) = X and head(e) = Y . We also refer to the in-edges of vertex v by in(v) := {e ∈ E : v ∈ head(e)} ,   17:12 and the out-edges of v by out(v) := {e ∈ E : v ∈ tail(e)} . Figure 1 shows a directed hyperedge. In ordinary directed graphs, a path from a vertex s to a vertex t is a sequence of edges starting from s and ending at t, where for consecutive edges e and f in the sequence, the preceding edge e must enter the vertex that the following edge f leaves. We say t is reachable from s when there is such a path from s to t.
In generalizing these notions to directed hypergraphs, the conditions both for when a hyperedge can follow another in a hyperpath, and when a vertex is reachable from another, become more involved. A hyperpath is again a sequence of hyperedges, but now for hyperedge f in a hyperpath, for every vertex v ∈ tail(f ) , there must be some hyperedge e that precedes f in the hyperpath for which v ∈ head(e) . Reachability is captured by the following notion of superpath.

Definition 1 (Superpath)
In a directed hypergraph (V, E), an s,t-superpath, for vertices s, t ∈ V , is an edge subset F ⊆ E such that the hyperedges of F can be ordered e 1 , e 2 , . . . , e k , where (iii) and t ∈ head(e k ).
For an s,t-superpath, we call s its source, t its sink, and we say t is reachable from s.
We can now define hyperpaths in terms of superpaths. Recall that a set S is minimal with respect to some property X if S satisfies X, but no proper subset of S satisfies X.

Definition 2 (Hyperpath) An s,t-hyperpath is a minimal s,t-superpath.
In other words, a hyperpath P is a superpath for which removing any edge e ∈ P leaves a subset P − {e} that is no longer a superpath. Essentially, hyperpaths eliminate unnecessary edges from superpaths. Figures 7,8,and 9 later show examples of hyperpaths.
We say a hyperpath P contains a cycle if, for every ordering e 1 , . . . , e k of its hyperedges satisfying properties (i)-(iii) in the definition of superpath, P contains some hyperedge f with a vertex in head(f ) that also occurs in tail(e) for an earlier hyperedge e in the ordering. While in ordinary graphs a minimal s,t-path can never contain a cycle, in hypergraphs an s,t-hyperpath can in fact contain cycles, as shown in our later section on biological examples.
We can now define the shortest hyperpath problem. For an edge weight function ω(e) , we extend ω to edge subsets F ⊆ E by ω(F ) := e∈F ω(e).

Definition 3 (Shortest Hyperpaths)
The Shortest Hyperpaths problem is the following. Given a directed hypergraph (V, E), a positive edge weight function ω : E → R + , source s ∈ V and sink t ∈ V , find an s,thyperpath P ⊆ E of minimum total weight ω(P).
Note that for positive edge weights, Shortest Hyperpaths is equivalent to finding an s,t-superpath of minimum total weight.
Shortest Hyperpaths with a single source and sink vertex also captures more general versions of the problem with multiple sources and multiple sinks, as follows. To find a hyperpath that starts from a set of sources S ⊆ V , simply add a new source vertex s to the hypergraph together with a single hyperedge ({s}, S) of zero weight, and equivalently find a hyperpath from the single source s. To find a hyperpath that reaches all vertices in a set of sinks T ⊆ V , add a new sink vertex t, a zero-weight hyperedge (T , {t}) , and equivalently find a hyperpath to the single sink t. To find a hyperpath that reaches some vertex in a set of sinks T ⊆ V , add new sink vertex t, zero-weight hyperedges ({v}, {t}) from all v ∈ T , and again equivalently find a hyperpath to the single sink t. Thus versions of shortest hyperpaths with multiple sources and sinks can be reduced to the problem with a single source and sink.
Shortest Hyperpaths is NP-complete [14] (even for acyclic hypergraphs with singleton head sets), so we likely cannot efficiently compute shortest hyperpaths in the worst-case. The next section presents an efficient heuristic for shortest hyperpaths that is highly accurate at finding demonstrably optimal or near-optimal hyperpaths in real cell-signaling hypergraphs.

An efficient shortest hyperpath heuristic
We now give a fast heuristic for Shortest Hyperpaths that always finds an s,t-hyperpath if one exists. While the heuristic is not guaranteed to find a shortest s,thyperpath in general, our later experiments on real cellsignaling hypergraphs show it quickly finds a hyperpath that is optimal or remarkably close to optimal on the vast majority of instances in comprehensive experiments over the two standard cell-signaling databases in the literature. Furthermore, we will prove that the heuristic is guaranteed to find a shortest s,t-hyperpath for the class of singleton-tail hypergraphs, where the tail-sets of all hyperedges are single vertices.
We present the heuristic by providing detailed pseudocode at a level that can be directly implemented, as the heuristic is carefully designed and many of its component algorithms are surprisingly tricky to implement correctly. After describing the heuristic, we give a time analysis that shows it is always efficient, prove its feasibility, and then show that it finds optimal hyperpaths for singletontail hypergraphs.
While at a high level the heuristic has some aspects in common with Dijkstra's algorithm for single-source shortest paths in an ordinary directed graph (see [26, pp. 658-659])-in that the heuristic maintains a heap of elements prioritized by estimated path lengths-it has significant differences. In contrast to Dijkstra's algorithm, the heuristic is edge-based, rather than vertex-based, and the heap maintains hyperedges e prioritized by the length of the shortest known hyperpath from the source s to edge e, which will be formally defined later. Also in contrast to Dijkstra's algorithm, maintaining a single inedge to a vertex no longer suffices for recovering a path back to source s; instead, recovering an s,t-hyperpath now requires the heuristic to maintain a set of in-edges to each hyperedge e that are candidates for the final edges on the path from s to e. Furthermore, the total length of a hyperpath P to e is no longer a simple function (like a minimum or a sum) of the lengths of hyperpaths to the in-edges of e in P that cover the tail of e, since the constituent hyperpaths within P to these in-edges of e can have arbitrarily-complicated sharing of hyperedges. Simply determining the length of the best recovered hyperpath for a hyperedge e on the heap, using these stored in-edges to each hyperedge, is itself now a hard combinatorial problem, which the heuristic tackles by a carefully-constructed greedy procedure.
The overall structure of the heuristic is a breadth-first search over the hyperedges e reachable from source s, ordered by the estimated length of the shortest hyperpath from s to e. (Admittedly a shortest s,t-hyperpath P is not necessarily composed of shortest hyperpaths from s to individual hyperedges e in P, which is partly why this approach is a heuristic.) The search repeatedly invokes a greedy procedure to recover the currently best-known hyperpath to e in order to evaluate its length. As hyperpaths are by definition minimal superpaths, to determine minimality this greedy recovery procedure repeatedly tests reachability of hyperedges. Moreover, for efficiency, the overall breadth-first search proceeds over a smaller subgraph of the input hypergraph that only contains hyperedges that are reachable both from source s and in reverse from sink t. Hence at base, the heuristic builds upon fast algorithms for computing reachability in a hypergraph.
Accordingly, to present the heuristic, we first give pseudocode for these fundamental algorithms for directed reachability. These algorithms use the following terminology of forward-reachable, backward-traceable, and doubly-reachable, which we define next.

Definition 4 (Reachability and Traceability
or recursively if v ∈ tail(e) for an edge e where some w ∈ head(e) is backward traceable from t. Hyperedge e is backward traceable from t if some v ∈ head(e) is backward traceable from t.
A vertex v or hyperedge e is doubly reachable if v or e, respectively, is both forward reachable from s and backward traceable from t.
To describe the heuristic, it will also be convenient to extend the definitions of superpath and hyperpath to a path from a source s to a hyperedge e. Definition 5 (Superpath and Hyperpath from Source to Hyperedge) An s,e-superpath is an edge subset S with e ∈ S where all vertices in tail(e) are forward reachable from source s using hyperedges in S. An s,e-hyperpath is a minimal s,e-superpath.
The pseudocode that we present accesses a hypergraph G through the fields G.vertices and G.edges. We access the tail-set and head-set of a hyperedge e through the fields e.head and e. tail. We access the set of in-edges and out-edges of a vertex v through the fields v.in and v.out. For a list Q that is handled as a queue, the operation Q. Put(x) appends item x to the rear of Q, while the operation Q.Get() removes and returns the item at the front of Q. For a min-heap H, the operation H.Insert(x, k) inserts item x with key k into H, and returns a reference p to the heap node containing this pair (x, k) in H; the operation H.Extract() removes and returns the item in H with minimum key; and the operation H.Decrease(p, k) takes a reference p to a heap node in H and decreases its key to k if k is smaller than its current key. All functions assume hypergraph G is passed by reference.   Figure 2 gives pseudocode for the two functions ForwardReachable and BackwardTraceable. Function ForwardReachable returns the set of all hyperedges that are forward reachable from source s, while function BackwardTraceable returns the set of hyperedges that are backward traceable from sink t. Function ForwardReachable uses the Boolean vertex field v.reached, and the integer edge field e.count, which it assumes have already been initialized to the values v.reached = false for all v ∈ V and e.count = tail(e) for all e ∈ E . Function BackwardTraceable also uses the Boolean edge field e.marked, which it similarly assumes is initialized to false for all e. (This initialization will be done once for hypergraph G in the shortest hyperpath heuristic, which allows these functions when called repeatedly to run in time bounded by just the size of the forward-reachable or backward-traceable subgraphs.) Function ForwardReachable uses the field e.count to detect when all vertices in tail(e) have been reached from s, and hence e is now reached from s. Function BackwardTraceable performs a similar but simpler computation in reverse from sink t. The worst-case time for both these functions is linear in the size of the subgraph they explore, as analyzed in the following section on the time-complexity of the heuristic. Figure 3 gives pseudocode for the function Short-estHyperpathHeuristic, our heuristic. Like Dijkstra's shortest path algorithm for ordinary graphs, this function maintains a heap H, but in contrast to Dijkstra's algorithm this is now a heap of hyperedges e (rather than a heap of vertices), which are prioritized by keys that are the best known estimate of the length of a shortest s,e-hyperpath. We refer to this estimate as the current path length for e. The heuristic starts from the out-edges of source s, and in a reaching computation repeatedly extracts from heap H the hyperedge e with minimum key. When hyperedge e is removed from H, the estimated path length for e is evaluated, and stored in field e.length. To compute this length estimate, it must construct the best s,e-hyperpath it can find, and evaluate its total weight. Of course, computing an optimal s,e-hyperpath is NP-complete, so it uses a greedy heuristic to construct this path by the function RecoverShortPath. This greedy pathconstruction heuristic consists of two steps: (1) recovering an s,e-superpath by recursively backward-traversing hyperedges that enter tail(e) , followed by (2) finding a minimal subset of this superpath that is an s,e-hyperpath while attempting to minimize its total weight. Figure 4 gives pseudocode for the function Recov-erShortHyperpath that implements this greedy path-construction heuristic. For the first step, recovering the s,e-superpath S is done by recursively backward-traversing what we call in-edges: those hyperedges whose head-sets intersect the tail-set of a given hyperedge. Function ShortestHyperpathHeuristic maintains for a hyperedge e the field e.inedges, which stores the subset of in-edges f to e whose field f. length has been determined.
For the second step, function RecoverShortHyperpath attempts to find the minimum weight subset of S that is still a superpath by greedily considering hyperedges f ∈ S for removal in decreasing order of f. length, which is the estimated total length of a shortest s,f-hyperpath. (Note this is more sophisticated than a naive greedy approach that simply removes hyperedges f in decreasing order of their edge-weight ω(f ) , which would degenerate to removing edges in random order in real cell-signaling networks where hyperedges typically all have unit weight, and hence would all be tied for removal.) This greedy process for trimming superpath S repeatedly tests whether tail(e) is still reachable from s on removing f by calling Boolean function IsReachable. Pseudocode for IsReachable is not given, but it simply implements a version of function ForwardReachable that halts and returns true as soon as it adds e to the set of hyperedges reachable from s, or returns false after collecting the entire reachable set without encountering e.
We note that most of the computation of the shortest hyperpath heuristic proceeds over a much smaller subgraph of the input hypergraph G: namely the subgraph induced by the hyperedges D ⊆ E that are doubly reachable (both forward reachable from s and backward traceable from t). This preserves correctness, since hyperedges that are not doubly reachable cannot be on an s,t-hyperpath and can safely be ignored (as argued in the later section on feasibility of the heuristic in the proof of Theorem 2).
To summarize, the shortest hyperpath heuristic proceeds greedily like Dijkstra's algorithm, but with some important differences: it maintains a heap of hyperedges prioritized by estimated shortest path lengths to tailsets, records a set of potential in-edges to a given hyperedge used for recovering a hyperpath to the hyperedge, and recovering such a hyperpath now involves another greedy heuristic to find a minimal superpath of small total weight.
Our later section on experimental results shows this heuristic is remarkably close to optimal on real cellsignaling hypergraphs. Given that no practical exact algorithm exists for general shortest hyperpaths, we determine the optimum by enumerating all s,t-hyperpaths and taking the minimum of their lengths, using an algorithm we develop in the later section on tractably generating all source-sink hyperpaths. (2022) 17:12 for e ∈ G.edges do (e.count, e.marked, e.node, e.inedges) := (|e.tail|, false, nil, ∅) • Recover a short hyperpath to e ... e.length := ω(P ) • ... and its path length • Recover the best s, t-hyperpath P * for e ∈ t.in do if e.node = nil then begin Restore to G all edges not in D • Unrestrict G and return the best hyperpath return P * end Fig. 3 Efficient heuristic for shortest source-sink hyperpaths. Given source s, sink t, and edge weights ω ,function ShortestHyperpathHeuristic finds an s,t-hyperpath in hypergraph G, attempting to minimize its length. If no s,t-hyperpath exists, the empty path is returned. For doubly-reachable hyperedges e, the heuristic maintains fields e.length (the total weight of the shortest hyperpath found to e), and e.inedges (the subset of edges f with head(f ) touching tail(e) where f.length is known), which are used in RecoverShortHyperpath to recover a short hyperpath to e We note for this heuristic that the inapproximability of the shortest hyperpath problem [16], together with the polynomial time analysis of the next subsection, imply that unless P = NP , the heuristic cannot be a constantfactor approximation algorithm for shortest hyperpaths.
In the following subsections, we first analyze the running time of the heuristic; then show it always finds a feasible solution whenever one exists; and finally prove it actually finds an optimal solution for the class of singleton-tail hypergraphs.

Time complexity of the heuristic
We now bound the time complexity of the shortest hyperpath heuristic. Our analysis is in terms of the following parameters measured on a hypergraph, or an induced subgraph. For a hypergraph G with vertices V and hyperedges E, we denote its number of vertices and hyperedges by Restore back to G all edges removed • Restore G and return hyperpath P return P end Fig. 4 Recovering a short hyperpath from the source to a hyperedge. Given source vertex s and hyperedge e, function RecoverShortHyperpath returns an s,e-hyperpath P in hypergraph G, attempting to minimize its length. The edges of hyperpath P are greedily selected from an s,e-superpath S that is guaranteed to exist in G, where S is recovered by tracing backward from e We also use the size parameter and degree parameter Note that in general, the space required to represent all hyperedges is �(ℓ) . We assume all tail and head sets are nonempty, and every vertex is touched by a hyperedge, which implies m + n = O(ℓ) . When we need to refer to these measures for a particular hypergraph G, such as on an induced subgraph, we explicitly subscript the parameters by the specific hypergraph, such as n G , . . . , d G , where n := |V | , m := |E| . these parameters are then measured in terms of the vertices and edges of the subscripted hypergraph G. The running time of the shortest hyperpath heuristic may be expressed as a function of parameters measured on both the input hypergraph and its doubly-reachable subgraph (induced by the hyperedges that are simultaneously forward reachable from the source and backward traceable from the sink). For ForwardReachable, let R ⊆ V be the set of vertices reachable from source s, and F ⊆ E be the set of hyperedges reachable from s that are returned. The total time for ForwardReachable is dominated by the time for its main while-loop, which takes time � v∈R out(v) + e∈F head(e) , or equivalently, For BackwardTraceable, let B ⊆ V be the set of vertices it reaches from sink t, and F ⊆ E be the set of hyperedges traceable from t that are returned. A similar analysis shows the time for BackwardTraceable is So the time for both ForwardReachable and Back-wardTraceable on the input hypergraph G is O ℓ Gbut can be bounded more tightly in terms of the subgraph of G they actually explore.
For the function RecoverShortHyperpath (in Fig. 4), when it is called by ShortestHyperpathHeuristic, all its computations are performed on G restricted to the edge subset D ⊆ E of doubly-reachable hyperedges. We denote by hypergraph H the doubly-reachable subgraph induced by D.
In RecoverShortHyperpath, the time to recover s,esuperpath S by tracing back from e is at most The time to greedily trim superpath S to s,e-hyperpath P ⊆ S , in terms of cardinality k = |S| , is at most Thus the total time for RecoverShortHyperpath is For the function ShortestHyperpathHeuristic (in Fig. 3 Finally, adding up the bounds for the above components, the total time for the shortest hyperpath heuristic is Notice that the overall running time of the heuristic is dominated by the total time to recover short hyperpaths, which requires invoking RecoverShortHyperpath whenever the path length to a hyperedge is updated. This is necessary in hypergraphs, since in contrast to ordinary graphs the length of the hyperpath to a hyperedge can no longer be expressed as a simple function (such as a minimum or a sum) of the lengths of the hyperpaths to its in-edges. As demonstrated in our later section on experimental results, for real biological instances the size of the doubly-reachable subgraph H is significantly smaller than the full input hypergraph G, so designing the heuristic to compute mainly over the much smaller hypergraph H yields a significant performance speedup in practice.
Next we show the heuristic always finds a feasible solution.

Feasibility of the heuristic
The most basic property that a heuristic for a combinatorial optimization problem should satisfy is feasibility: that it always returns a feasible solution whenever one exists. In the context of Shortest Hyperpaths, a feasible solution is any s,t-hyperpath, while an optimal solution is a feasible solution of minimum total edge-weight.
For the hyperpath heuristic, we now show feasibility.

Theorem 2 (Feasibility of the heuristic)
The shortest hyperpath heuristic finds a source-sink hyperpath whenever one exists.

Proof
Function ShortestHyperpathHeuristic (in Fig. 3 Furthermore, we claim that when restricting to the doubly-reachable subgraph G , the heuristic does not lose any hyperedges on source-sink hyperpaths. Note that any hyperedge e on an s,t-hyperpath P in the input hypergraph G is forward reachable from s: consider the ordering of hyperedges in P from Definition 1, and take the prefix of this ordering up through e; this prefix is an s,e-superpath, so e is by definition forward reachable from s. Note also that any e on P in G is backward traceable from t as well: if t ∈ head(e) , backward traceability immediately holds; otherwise, in the ordering of P there must be a hyperedge f following e with nonempty head(e) ∩ tail(f ) (else e can be removed from P, contradicting minimality); applying this same process again at f yields a subsequence of the ordering of P that ends in a hyperedge whose head contains t; considering this subsequence in reverse order satisfies Definition 4 for backward traceability of e from t. Hence restricting to the doubly-reachable subgraph G is safe.
To show the implication of the theorem, notice Short-estHyperpathHeuristic explores all hyperedges that are forward reachable from s in G , inserting hyperedge e into heap H when e is initially reached, again detecting when traversing e causes another hyperedge f to be first reached using counter f.count, and recording in field f.inedges all such e that have reached f. So if an s,t-hyperpath exists in G, which implies sink t has an inedge e that is forward reachable from s in G , this e will eventually be inserted into H, making e.node non-nil, and at the end of the heuristic causing RecoverShortHyperpath to be called on e.
We claim that when function RecoverShortHyperpath (in Fig. 4) is ultimately called on an in-edge to sink t, phase (I) first recovers an edge set S that is an s,t-superpath in G. Considering the hyperedges of S in reverse order of their removal from queue Q, they satisfy the three conditions for an s,t-superpath in Definition 1: the last hyperedge removed from Q solely has s in its tail, each hyperedge in S (other than this last one) has its tail set covered by hyperedges removed later from Q, and the first edge removed has t in its head.
Function RecoverShortHyperpath in phase (II) then trims S to a minimal s,t-superpath, yielding an s,thyperpath. Finally, ShortestHyperpathHeuristic returns the shortest such hyperpath found.
Thus whenever a source-sink hyperpath exists, the heuristic finds one.
Next we prove the heuristic actually solves Shortest Hyperpaths when the input is a singleton-tail hypergraph.

Optimality of the heuristic for singleton-tail hypergraphs
While our heuristic does not necessarily find shortest hyperpaths in general hypergraphs, we can prove that it does find optimal solutions for the following class of hypergraphs.
A singleton-tail hypergraph is a directed hypergraph G where every hyperedge e in G has tail(e) = 1 . (The head sets of hyperedges can be arbitrary.) In other words, in singleton-tail hypergraphs, the tails of hyperedges are single vertices.
At a high level, the optimality argument for singletontail hypergraphs first shows that shortest source-sink hyperpaths are composed of shortest s,e-hyperpaths; then argues that the heuristic's greedy superpath trimming recovers shortest s,e-hyperpaths when the hyperedge fields hold shortest hyperpath lengths; and finally proves that the heuristic computes exact shortest s,ehyperpath lengths.
The following characterization states that in singletontail hypergraphs, a shortest s,t-hyperpath is composed of shortest s,e-hyperpaths to its constituent hyperedges. This does not hold for general hypergraphs, and is partly why the special case of shortest singleton-tail hyperpaths is polynomial-time solvable.

Lemma 1 (Characterizing shortest singleton-tail hyperpaths)
In singleton-tail hypergraphs with nonnegative edge weights, every shortest s,t-hyperpath can be ordered as a sequence e 1 · · · e k of hyperedges where (i) each head(e i ) ⊇ tail(e i+1 ), and (ii) every prefix e 1 · · · e i is a shortest s, e i -hyperpath.

Proof
Consider a shortest s,t-hyperpath P in a singleton-tail hypergraph. By definition, P is a minimal s,t-superpath, so its edges can be ordered as a sequence e 1 · · · e k where tail(e 1 ) = {s} , head(e k ) ⊇ {t} , and since tail sets contain a single vertex, for every hyperedge e j in this sequence other than the first one, there is a prior hyperedge e i with head(e i ) ⊇ tail(e j ).
Clearly every prefix e 1 · · · e i is an s, e i -superpath. Moreover this prefix must be a minimal s, e i -superpath, otherwise P is not minimal. Thus every prefix ending in e i is an s, e i -hyperpath.
Finally, every prefix e 1 · · · e i must be a shortest s, e i -hyperpath. Otherwise, replacing this prefix by a shortest s, e i -hyperpath yields an s,t-superpath S of total weight less than P. Furthermore, trimming S to a minimal s,t-superpath under nonnegative edge weights yields an s,t-hyperpath of total weight less than P, contradicting the optimality of P.
In the following, the distance of hyperedge e from source s is the total weight of a shortest s,e-hyperpath, which we denote by d(e). Recall that function Short-estHyperpathHeuristic (in Fig. 3) maintains the field e.length, that holds the total weight of the bestknown s,e-hyperpath, which upper bounds d(e).
The next lemma states that in singleton-tail hypergraphs, given two key conditions, the greedy superpath trimming that is used by the heuristic to recover a hyperpath to hyperedge e in fact finds a shortest s,e-hyperpath. Then the hyperpath to e that the heuristic recovers is a shortest s,e-hyperpath.

Proof
We first claim that under the assumptions of the lemma, when the hyperpath heuristic calls RecoverShort-Hyperpath (in Fig. 4)

on a hyperedge e, its first phase recovers an s,e-superpath S that contains a shortest s,ehyperpath. By assumption (i), field e.inedges contains a hyperedge f on a shortest s,e-hyperpath, and f will be in superpath S, hence by assumption (ii), the value of f.length is d(f ). This value came from a shortest s,f-hyperpath Q that was found in a prior call to
RecoverShortHyperpath on f, by trimming an s,f-superpath T. Notice that Q followed by e is an s,e-superpath P , as head(f ) ⊇ tail(e) . Now trim P to an s,e-hyperpath P, and let P * be a shortest s,e-hyperpath containing f that exists by assumption (i). By Lemma 1 and minimality of hyperpaths, P * must consist of a shortest s,f-hyperpath Q * followed by e. Under nonnegative edge weights, Thus P is also a shortest s,e-hyperpath. Since f is in e.inedges, tracing back from e recovers the superpath so the claim holds.
We next claim that when RecoverShortHyperpath in its second phase greedily trims superpath S, the resulting superpath T ⊆ S still contains a shortest hyperpath. To show this, we prove that each superpath S i that remains after i iterations of greedy trimming contains a shortest s,e-hyperpath, by induction on i. For the basis at i = 0 , the initial superpath S 0 before any trimming contains a shortest hyperpath by our first claim on S. For the induction at i > 0 , let P be a shortest s,e-hyperpath that superpath S i−1 contains by our hypothesis, and let f be the hyperedge removed from S i−1 at iteration i. If f ∈ P , then S i = S i−1 − {f } trivially contains P. So we assume f ∈ P . In the following, the core of hyperpath P consists of the tail vertices of its hyperedges.
In an ordering of shortest hyperpath P that satisfies Lemma 1, consider the hyperedges in the suffix of P that begins with f. As edge weights are nonnegative, by Lemma 1 the distances of these hyperedges must be at least d(f), so by assumption (ii) the values of the length field for these hyperedges must be at least f.length. Greedy trimming proceeds in decreasing order of length-field values, so the hyperedges in this suffix of P must either have been already considered for trimming before f, or not yet considered due to being tied with f (from having zero edge-weight). If they were considered before f, then since they were not trimmed, there must be no alternate s,e-hyperpath in S i−1 that enters their head vertices on the core of P. If they were not considered yet, then since f can be removed from S i−1 , there must be an alternate s,e-hyperpath Q ⊆ S i distinct from P that enters one of the core head-vertices of the hyperedges in this suffix of P whose length field is tied with f. Moreover, this alternate hyperpath Q must enter P with the same length-field value as the edge of P sharing this core headvertex. (If Q enters P at a smaller length-value, then P is not a shortest s,e-hyperpath; if Q enters at a greater length-value, hyperedge f would not be the next hyperedge removed, as instead a hyperedge from Q of greater length would be.) Since Q enters P at the same lengthvalue, hyperpath Q is also a shortest s,e-hyperpath.
Hence S i ⊇ Q still contains a shortest hyperpath, which proves the second claim.

So the final trimmed s,e-superpath T returned by
RecoverShortHyperpath contains a shortest s,ehyperpath P ⊆ T . Since T is minimal (as no further edges could be trimmed), and P by definition is minimal, we must have T = P , which proves the lemma.
We now show that the hyperpath heuristic solves Shortest Hyperpaths for singleton-tail hypergraphs.

Theorem 3 (Optimality of the heuristic on singleton-tail hypergraphs)
For singleton-tail hypergraphs with nonnegative edge weights, the hyperpath heuristic finds a shortest source-sink hyperpath.

Proof
The key to proving optimality is showing that in singletontail hypergraphs, the estimates that the heuristic computes for shortest hyperpath lengths are exact. Recall that when function ShortestHyperpathHeuristic (in Fig. 3) removes hyperedge e from heap H, it calls Recov-erShortHyperpath on e to recover an s,e-hyperpath P, and sets the field e.length to ω(P) , the total weight of P.
We claim that when this assignment occurs, field e.length holds distance d(e), the total weight of a shortest s,ehyperpath. We now prove this claim by induction on the number of heap extractions. At a high level, the argument is similar to that for Dijkstra's shortest-path algorithm (see [26, pp. 659-661]) on ordinary directed graphs.
For the basis, the first hyperedge extracted has tail(e) = {s} and e.key = ω(e) , which equals d(e), as e itself is a shortest s,e-hyperpath (since all edge weights are nonnegative). The recovered s,e-hyperpath will consist of e (as e.inedges is empty), so after the assignment field e.length holds d(e).
For the induction, let e be the next hyperedge to be removed from the heap, and assume for all hyperedges h extracted prior to e that h.length holds d(h). Now consider a shortest s,e-hyperpath P, and in the ordering of P given by Lemma 1, let f be the first hyperedge in P that has not yet been removed from the heap. Note that f exists, as e has not been removed yet.
We first show f .key = d(f ) . In the special case where f is the first edge of P, notice d(f ) = ω(f ) by the same reasoning as in the basis. Furthermore f .key = ω(f ) , as f.key starts at ω(f ) , never increases, and cannot decrease below this minimum value. So f .key = d(f ) in this special case.
In the general case where f is not the first edge of P, let g be the in-edge to f on P, and Q ⊆ P be the prefix of P ending in f, as illustrated in Fig. 5. Notice g has already been extracted from the heap (by the definition of f), so g is in f.inedges (as when a hyperedge is extracted, for all its out-edges h it is added to h.inedges). Furthermore Q is a shortest s,f-hyperpath by Lemma 1, so g is on a shortest hyperpath to f. For all hyperedges h extracted before e, by the induction hypothesis h.length = d(h) , and only extracted h add themselves to the field inedges of their out-edges. Hence when g was extracted, added itself to f. inedges, and updated f.key by recovering an s,f-hyperpath, in the s,f-superpath S first found during recovery, all hyperedges h ∈ S had h.length = d(h) . Thus by Lemma 2, the recovered s,f-hyperpath was a shortest hyperpath, so this updated f.key to d(f), and as argued before in the special case, this key will not change. So again f .key = d(f ).
We next show, (1) e.key ≤ f . key In the above, inequality (1) holds since e and f are both on the heap (as f was inserted in the heap either during initialization or when g was extracted), but e is removed before f. Equation (2) is from our prior analysis of f. Inequality (3) holds as Q and P are shortest s,f-and s,ehyperpaths respectively, while Q ⊆ P and edge weights are nonnegative. Lastly, inequality (4) holds since the key of e while it is on the heap is the total weight of some s,ehyperpath. Thus relations (1)-(4) must all be equalities, so e.key = d(e).
We now argue e.length = d(e) after e is extracted. Since e.key = d(e) is the weight of a hyperpath recovered earlier for e, notice (i) there was an in-edge to e on a shortest s,e-hyperpath in e.inedges; moreover (ii) all hyperedges h in the s,e-superpath collected while recovering a hyperpath for e were extracted earlier, and hence by the induction hypothesis had h.length = d(h) . Furthermore, hyperedges are never removed from the field inedges, and h.length never changes after h is extracted. Thus the assumptions in Lemma 2 are still met upon extraction of e, so when ShortestHyperpathHeuristic assigns to e.length the total weight of the hyperpath P recovered for e, by Lemma 2 this recovered P will again be a shortest s,e-hyperpath, hence e.length = d(e) . This completes the inductive proof of our claim.
So for every hyperedge h in the doubly-reachable subgraph explored by ShortestHyperpathHeuristic, after extracting h from the heap, the relation h.length = d(h) holds. Finally, when recovering the best s,t-hyperpath at the end of the heuristic by examining the in-edges e to sink t, for each such hyperedge e the assumptions of Lemma 2 are still met (by the same reasoning as above), so the hyperpaths P obtained from calling RecoverShortHyperpath on these sink inedges e are again shortest s,e-hyperpaths. Since a shortest s,t-hyperpath consists of doubly-reachable hyperedges (by the proof of Theorem 2), and is a shortest s,e-hyperpath for some in-edge e to sink t, the best of these recovered hyperpaths P, which is the hyperpath returned by the heuristic, is a shortest s,t-hyperpath.
Theorem 3 (in combination with Theorem 1) shows that, while Shortest Hyperpaths is NP-complete for singleton-head hypergraphs [14], it is polynomial-time solvable for singleton-tail hypergraphs.

Fig. 5
Hyperpath from the proof of optimality for singleton-tail hypergraphs. Hyperedges inside the dashed circle have been extracted from the heap; those outside have not. The next hyperedge to be extracted is e, and P is a shortest s,e-hyperpath. The first hyperedge of P not yet extracted is f, and Q is the prefix of P up through f

Generating all source-sink hyperpaths
In this section, we give a practical algorithm for generating all s,t-hyperpaths in a given hypergraph for a fixed source s and sink t. In our later experimental results, we use this algorithm on specific source-sink instances from real cell-signaling networks to tractably measure how close our heuristic is to optimal.
In general, the technique of inclusion and exclusion of Hamacher and Queyranne [27] provides a widely-applicable method for generating all the solutions to any combinatorial optimization problem whose feasible solutions are subsets of a ground set-where in our context, hyperpaths are subsets of hyperedges from a hypergraph-but it relies on the ability to efficiently compute a feasible solution that is constrained to include a given in-set and exclude a given out-set. Interestingly, for hyperpaths, Carbonell et al. [20] have shown that just determining whether an s,t-hyperpath exists that contains a specified in-set of hyperedges (regardless of the length of the hyperpath) is already NP-complete. Consequently, we cannot generate all s,t-hyperpaths using the standard inclusion-exclusion technique, as we cannot tractably solve the resulting subproblem that has both in-and outset constraints.
Instead, we generate all hyperpaths through a simple and practical algorithm that only involves out-sets, given in Fig. 6. Function AllHyperpaths returns a list of all s,t-hyperpaths in hypergraph G, leveraging a function OneHyperpath that just has to return one s,thyperpath P in G that does not contain any hyperedges from set Out (so P ∩ Out = ∅ ), or determine that no such hyperpath exists. This constrained hyperpath problem with only out-sets is easy to solve: remove all hyperedges in set Out from G, collect all vertices R and hyperedges F reachable from s in this reduced hypergraph, and if t ∈ R , then find any minimal subset P ⊆ F in which t is still reachable from s; otherwise if t ∈ R , no such hyperpath exists. Function OneHyperpath can efficiently find such an s,t-hyperpath P excluding set Out using repeated calls to ForwardReachable (given earlier in Fig. 2).
Function AllHyperpaths uses a queue of subproblems. A subproblem is described by a pair (Out, Keep) , which corresponds to finding an s,t-hyperpath excluding Out, where any subsequent subproblems that arise from this given subproblem must not exclude any hyperedges from the set Keep (though their solutions are not required to actually use edges from Keep). The purpose of this set Keep is to ensure that all subproblems ever placed on the queue have distinct Out sets. (So any given subproblem described by an out-set is only ever solved once, as argued in the later section on the time complexity of the hyperpath enumeration algorithm in the proof of Theorem 5.) A subproblem that directly arises from a given one we call a child subproblem (as the entire collection of subproblems conceptually forms a tree that is explored breadth-first using the queue). Each child subproblem excludes one edge from the hyperpath found for its parent subproblem; in this way, the children will generate hyperpaths that are distinct from their parent hyperpath, if they have a solution. (Once a subproblem becomes infeasible due to its out-set eliminating any s,thyperpath as a solution, it also does not generate further subproblems.) Though the whole approach never repeatedly solves the same subproblem, in contrast to the inclusion-exclusion technique it can generate the same hyperpath from different subproblems, so we check whether hyperpath P is distinct from those already found before adding it to the list A of all hyperpaths. We first prove this enumeration approach is correct, and then analyze its time complexity.

Correctness of the hyperpath enumeration algorithm
We next show that function AllHyperpaths solves the problem of source-sink hyperpath enumeration.

Proof
For the function AllHyperpaths (in Fig. 6), we view the subproblems it processes as forming a tree: when a problem p is pulled off queue Q and causes a new subproblem q to be put onto Q, these subproblems q comprise the children of p in the tree. Each subproblem is specified by a pair (Out, Keep) , representing the problem of finding an s,t-hyperpath that contains no hyperedge in the set Out. Let P be an s,t-hyperpath satisfying this out-constraint for problem p. Any other s,t-hyperpath P distinct from P that also satisfies the out-constraint for p must not contain some hyperedge in P. (If P contains every hyperedge of P yet is distinct, it is a strict superset of P, contradicting minimality.) Function AllHyperpaths forms the children of p by adding each hyperedge in P to the outset of p for a different child. (So the hyperpaths satisfying the out-constraints of the children are all hyperpaths that both satisfy the constraints of parent p and are distinct from hyperpath P.) Consequently hyperpath P, together with every solution to the children of p, comprise all possible solutions to problem p.
This tree-like process begins at the root with a problem having an empty out-set (whose solutions are all possible s,t-hyperpaths), and continues refining each problem into its children subproblems until reaching the leaves (which have no solution). Thus the set consisting of each hyperpath P found at the nodes of this tree contains all s,t-hyperpaths.
In brief, function AllHyperpaths generates every s,thyperpath. Since it checks for uniqueness, the enumeration algorithm generates every source-sink hyperpath exactly once.

Time complexity of the hyperpath enumeration algorithm
We now bound the running time of function AllHyperpaths in terms of the number of subproblems it solves, and parameters of the input hypergraph.

Proof
We bound the running time of function AllHyperpaths (in Fig. 6) as follows. Solving a given subproblem from the queue by function OneHyperpath (which finds an s,t-hyperpath by iteratively removing hyperedges from the hypergraph and testing reachability to identify a minimal set in which t is still reachable from s), involves at most m calls to function ForwardReachable. A call to ForwardReachable takes O(ℓ) time (by the analysis in the proof of Theorem 1), so solving a subproblem takes O(ℓ m) time. If AllHyperpaths terminates after processing k subproblems, its total time is then O(k ℓ m).
We argue next that the out-sets of subproblems are all distinct. Consider the tree of subproblems processed by AllHyperpaths (as in the proof of Theorem 4), and two arbitrary subproblems x and y in this tree. If one of x and y is a descendant of the other, their out-sets are distinct, as a child always adds a hyperedge to the out-set of its parent. If neither x nor y is a descendant of the other, let subproblem u be their nearest common ancestor, subproblems v and w be the children of u on the paths to x and y respectively, and assume without loss of generality that child v precedes child w. When child v adds hyperedge e to the set Out of its parent u, edge e is not added to set Out for any other children of u, and e is also added to set Keep for all children of u following v, including w.
Furthermore, the set Out for a descendant is a superset of set Out for its ancestors, and set Out for a descendant is always disjoint from set Keep for its ancestors. Consequently, the above hyperedge e is in the out-set of subproblem x but not subproblem y, so their out-sets are again distinct.
Since subproblem out-sets are distinct, k = O(2 m ) . Combining this with the prior total time for hyperpath enumeration yields a worst-case time bound of O(2 m ℓ m).
In practice, typically k ≪ 2 m , so the running time is much faster than the worst-case bound suggests. Function AllHyperpaths can tractably generate all sourcesink hyperpaths for large hypergraphs, as shown in the next section on experimental results, since many of its subproblems quickly become infeasible for real cell-signaling networks.

Experimental results
We now present results from computational experiments on real pathway databases that compare the hyperpath found by our heuristic to the optimal solution. We also remark on the prevalence of biological instances with cyclic shortest hyperpaths, study the cause of suboptimality in our heuristic, report actual running times, and discuss biological examples of cyclic hyperpaths.

Datasets
We evaluate the quality of our heuristic on four datasets built by combining different annotated signaling pathways from two pathway databases, NCI-PID and Reactome. NCI-PID [28] is a curated human-pathway database containing biochemical reactions for complex assembly, cellular transport, and transcriptional regulation. Reactome [29] also contains curated human signaling pathways, and is actively maintained with new reactions being continuously added. We constructed hypergraphs from three subsets of NCI-PID pathways used in Ritz et al. [5], named the Small, Medium, and Large datasets. The Small dataset is a small Wnt signaling pathway consisting of the union of two pathways: "degradation of β-catenin" and "canonical Wnt signaling". The Medium dataset is a larger Wnt signaling pathway including four additional pathways: "noncanonical Wnt signaling", "Wnt signaling network", "regulation of nuclear β-catenin", and "presenilin action in Notch and Wnt signaling", which correspond to non-canonical branches of Wnt signaling. The Large dataset contains all NCI-PID pathways. Similarly, the Reactome dataset is the union of all Reactome pathways. The NCI-PID and Reactome pathways were downloaded in the BioPAX format [30] from Pathway Commons, and processed using a parser from Franzese et al. [22] built on PaxTools [31].
To construct the hypergraphs for each dataset, we mapped each entity (such as a protein, small molecule, and so on) to a vertex in the hypergraph. Each complex was represented as a unique vertex distinct from the entities in the complex. Multiple forms of the same protein map to different vertices denoting compartmentalization and post-translational modifications, such as phosphorylation and ubiquitination. We treated each variant as a distinct entity because many pathways describe the transportation of a protein from one cellular compartment to another, or the marking of a protein for degradation by ubiquitination, necessitating that the corresponding vertices be distinct to reflect these variants. Each reaction was mapped to a hyperedge, where the reactants and positive regulators comprise the tail of the hyperedge, and the products comprise the head. All hyperedges were given unit weight, even though the heuristic handles weighted edges, as NCI-PID is missing reaction rates for some reactions. Table 1 gives statistics on the hypergraphs constructed from each of the four datasets. The hypergraphs are very sparse: there are fewer hyperedges than vertices in all four datasets, with Reactome being even sparser than the NCI-PID datasets. The hypergraphs from the Large and Reactome datasets contain respectively 40 and 433 self-loops, showing that many cyclic hyperpaths are likely to exist. However, a small number of these self-loops are unreachable, due to an otherwise unreachable vertex appearing in both the head and tail of the hyperedge. The sources and targets used in all our experiments are respectively vertices with no in-edges (or vertices whose only in-edge is an unreachable self-loop), and vertices with no out-edges. The number of forward-reachable, backward-traceable, and doubly-reachable hyperedges shows how many hyperedges remain after the heuristic prunes the input hypergraph to the doubly-reachable subgraph before computing a solution. On average, hyperedges from all four hypergraphs have small head and tail sets, and vertices have low in-and out-degree, reflecting the sparseness of the hypergraphs.

Experimental setup
To prepare the hypergraphs from each dataset for our experiments, we parsed the union of the pathways in the dataset. We connected a supersource s to all source vertices-namely, the input vertices with no in-edges-by a single zero-weight hyperedge whose tail consisted of the supersource s and whose head contained all the source vertices. We also included in the head of this hyperedge from supersource s all input vertices whose sole in-edge was a self-loop, since otherwise such a self-loop was not Table 1 Dataset Summaries  traversable. For each specific target vertex v-namely, each input vertex with no out-edges-we had a separate version of the hypergraph that differed only by connecting this target v to a sink t by a single zero-weight ordinary-graph edge directed from v to t, giving us a specific target instance. Note that these choices for the source and target vertices are reasonable, as they are the molecules where biologists stopped annotating a given pathway. Note also that the supersource s and the sink t remain the same across all target instances in a dataset. For each target instance, we trimmed the hypergraph to the doubly-reachable set: the set of hyperedges that were both forward-reachable from supersource s, and backward-traceable from sink t. Table 1 gives the average and maximum size of the forward-reachable, backward-traceable, and doubly-reachable sets over all target instances for a given dataset, which dramatically reduces the size of the hypergraph over which the heuristic performs most of its computation.

Measure
For each target instance, we found a hyperpath from supersource s to sink t using our shortest hyperpath heuristic implemented in the new tool Hhugin [25], and compared its length to the solution of the MILP of Ritz et al. [21] if the heuristic hyperpath was acyclic. For each cyclic target instance where the heuristic output a cyclic hyperpath, we exhaustively enumerated all s,thyperpaths, and compared the heuristic hyperpath to the shortest hyperpath found by this enumeration. (Enumerating all s,t-hyperpaths for one source-sink instance takes on average around 20 hours in practice-so it is not feasible to perform this enumeration on all acyclic target instances.)

Abundance of cyclic hyperpaths
Cyclic shortest hyperpaths appear in all four datasets. To take just one example, in the Small and Medium datasets, the only hyperpath from ubiquitinated β-catenin to APC is cyclic, so for this target instance the acyclic shortest-hyperpath MILP fails to find a solution. Admittedly this particular source-target pair is specially chosen, as ubiquitinated β-catenin has an in-edge and APC has an out-edge so they would not normally be considered under our definition of sources and targets. Nevertheless, this pair demonstrates there do exist cyclic hyperpaths in the NCI-PID database-even in the union of just two pathways-that are missed by the current state-of-the-art when computing only acyclic shortest hyperpaths.
In the Large dataset, 38 target instances have cyclic heuristic hyperpaths. Of these, 22 were cyclic because of a self-loop, and 16 were cyclic due to a non-trivial cycle. For all these instances, no acyclic hyperpath exists between supersource s and sink t. It is likely that even more cycles exist within the hypergraph from the Large dataset, as there were 8 self-loops that were not on any hyperpath found by the heuristic.
In the Reactome dataset, the heuristic found a cyclic shortest hyperpath on 22 target instances, and only one of these instances was cyclic due to a self-loop. In general, Reactome is much sparser than NCI-PID, and 432 of the 433 self-loops in Reactome are never used in a heuristic hyperpath.
The abundance of cyclic hyperpaths in the NCI-PID and Reactome datasets demonstrates the importance of a shortest hyperpath algorithm that properly handles cycles. We discuss concrete examples of biological cyclic shortest hyperpaths in a later section on biological examples.

Quality of the hyperpath heuristic
To determine the quality of our hyperpath heuristic, we compared the length of the heuristic hyperpath to an optimal shortest hyperpath. In general, no practical exact algorithm is currently known for finding a shortest source-sink hyperpath. Consequently, on the target instances where the heuristic found a cyclic hyperpath, we determined the optimum by generating all sourcesink hyperpaths and retaining the shortest one, using our algorithm for hyperpath enumeration. On the target instances where the heuristic found an acyclic hyperpath, we compared its length just to the optimal hyperpath returned by the MILP for shortest acyclic hyperpaths. An even shorter cyclic hyperpath could exist for these latter instances, but finding it by enumerating all hyperpaths is simply too time-consuming to carry out for every such instance. Table 2 summarizes the quality of the heuristic on acyclic instances. On the Small, Medium, and Reactome datasets, the heuristic hyperpath is optimal on all target instances, meaning the heuristic hyperpath and the shortest acyclic hyperpath from the MILP have the same length. On the Large dataset, the heuristic is optimal on over 99% of the instances, demonstrating the quality of the heuristic on these biological datasets. The small fraction of instances where our heuristic was suboptimal are discussed in more detail in the next subsection.    Table 3 summarizes the quality of the heuristic on instances where it output a cyclic hyperpath. On all these cyclic instances, the acyclic MILP failed to find a solution, so we could not compare the heuristic to an optimal hyperpath other than by exhaustively enumerating all hyperpaths and picking the shortest one-which verified that the heuristic on these instances in fact found an optimal solution. Cyclic instances from the Reactome (and Large) datasets contain many distinct hyperpaths, with a median of 22 (respectively 3) hyperpaths, and a maximum of 136 (respectively 364) hyperpaths. The hyperpaths tend to vary in length, with a maximum difference between the length of the longest and shortest hyperpath of 15 (respectively 43) hyperedges, and a median difference of [2,3] (respectively 1) hyperedges. This demonstrates that the heuristic is discriminating between hyperpaths of different lengths and choosing the best hyperpath over worse hyperpaths, further indicating the quality of the heuristic. In every cyclic target instance, all s,t-hyperpaths were cyclic, and many shared a common cycle; most of the hyperedges occurring in one hyperpath but not another appeared outside this shared cycle.

Studying the suboptimality of the heuristic
We call the small number of target instances in these experiments where the heuristic found a known suboptimal hyperpath its suboptimal instances. Table 4 summarizes these 23 suboptimal instances, which are all from the Large NCI-PID dataset, and are all acyclic instances. (The heuristic was optimal on all cyclic instances, and all Reactome, Small, and Medium instances. We mention as well that the maximum values across the table occur in distinct target instances.) To gain insight into why the heuristic found a suboptimal solution on these instances, we enumerated all source-sink hyperpaths for every suboptimal instance.
(This enumeration also verified that on all suboptimal instances, the acyclic MILP in fact found a shortest hyperpath, as there was no shorter cyclic hyperpath.) Hyperpath enumeration confirmed that these suboptimal instances are much harder than the cyclic instances. The median number of hyperpaths is nearly 140 times higher for suboptimal NCI-PID instances compared to cyclic NCI-PID instances, and the length difference between the longest and shortest hyperpaths is 30 times larger. This stark contrast indicates the inherent difficulty of these suboptimal instances, where the heuristic must now discriminate among a much higher number of hyperpaths that have much greater path-length variance. The fraction of all hyperpaths that are optimal is fairly small, with only around 3% being optimal for the median instance. Even faced with many alternate solutions, the heuristic still found a hyperpath that was nearly optimal: the median difference between the length of the heuristic hyperpath and the shortest hyperpath was 1 hyperedge, the maximum difference was 6 hyperedges, and the median ratio of the length of the heuristic hyperpath to the shortest hyperpath was 1.1 (so it was only 10% longer). Next we investigate what could be causing this suboptimality.
The suboptimality of the heuristic is likely coming from the repeated calls to the function RecoverShortHyperpath, which proceeds in two phases. In phase (I), this function recovers an s,e-superpath S, relying on inedge lists to hyperedges f, where the in-edge list for f contains only hyperedges removed from the heap prior to f, which may exclude hyperedges in a shortest s,e-hyperpath. In phase (II), this function trims superpath S to a hyperpath by greedily considering hyperedges in S for removal, which may also remove a hyperedge in an optimal s,e-hyperpath.
To determine whether the recover or trim phases were responsible for suboptimality, we ran the following experiment. After the heuristic determined its estimated path length for every hyperedge in the hypergraph, we called RecoverShortHyperpath on each in-edge to the target where we ran its recovery phase but stopped before its trimming phase, and unioned together the resulting s,t-superpaths from each in-edge to create one large s,tsuperpath F. We then took an optimal s,t-hyperpath P and examined whether P ⊆ F : in other words, whether the recovery phase permitted the heuristic to potentially find an optimal hyperpath. We discovered that for all 23 suboptimal instances P ⊆ F , indicating phase (I) of RecoverShortHyperpath that recovers an s,e-superpath was forcing the heuristic to be suboptimal on every instance. On the other hand the trimming phase of Recover-ShortHyperpath could also be leading to suboptimality, which we investigated as follows. For each suboptimal instance, we modified the recovery phase of Recov-erShortHyperpath to use all in-edges in the hypergraph to each hyperedge, rather than the in-edge lists collected by the heuristic. (In this situation, the recovered superpath F definitely contains a shortest hyperpath P.) Phase (II) then trimmed this superpath as normal. We discovered that the trimming phase often fails to find a shortest hyperpath within this larger superpath (which was the entire doubly-reachable subgraph). This indicates that while phase (I) is definitely causing suboptimality, simply changing phase (I) to recover a larger superpath may in turn lead to suboptimality in phase (II).

Implementation and running time
The heuristic is implemented in Python 2.7.3, comprising around 500 lines of code. The parser used to convert Hyperedges e 1 , e 5 , and e 6 , shown in red, create two separate cycles back to α-catenin and δ-catenin